‘White Wine Quality Analysis’ by Woo-Young Yang

===================================================================================================================

Introduction

White wine is beloved by many people over centuries. Its flavor assist for food to taste better. The purpose of this project is looking for which chemical component influence taste of wine, which is equivalent to quality. This project will handle univariate, bivariate and multivaraite analysis. The dataset is from Portuguese “Vinho Verde” which located in Minho province in the far north of the Portugal. (http://www.vinhoverde.pt/)

Dataset

This dataset contains 4898 observations of 12 variables.

The output variable is ‘quality’, so I will check relationship between other variables and quality.

##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##                                                                     
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##                                                            
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##                                                                   
##     quality      quality.factor quality.factor_ABC    quality.factor_HL
##  Min.   :3.000   3:  20         A:1060             High Quality:1060   
##  1st Qu.:5.000   4: 163         B:2198             Low Quality :3838   
##  Median :6.000   5:1457         C:1640                                 
##  Mean   :5.878   6:2198                                                
##  3rd Qu.:6.000   7: 880                                                
##  Max.   :9.000   8: 175                                                
##                  9:   5

The column ‘X’ is just number of all wine, so it isn’t helpful. And ‘quality’ is useful when it is factor form. So ‘X’ column has been dropped and ‘quality.factor’ column has been added for better analysis.

Quality is ranged between 3 and 9, but there are only 20 wine in quality 3, and only 5 wine in quality 9. So two variables ‘quality.factor_ABC’ which divided wine quality into 3 classes (High for A, Middle for B, and Low for C), and ‘quality.factor_HL’ which divided wine quality into 2 classes (Just High and Low) have been added. It will prevent wine which have quality 3 and 9 to be outlier. It will be used after ‘quality_factor’ analysis.

The range of bins altered many times. Such as ‘A = Quality 8, 9’, ‘B = Quality 5,6,7’, ‘C = Quality 3,4’ and other combinations. But ‘A = Quality 7, 8, 9’, ‘B = Quality 6’, ‘C = Quality 3, 4, 5’ was most suitable for plotting. Although ‘Quality B’ has only one quality(6), its count is most frequent (2198).

And when divided two class ‘quality.factor_HL’ ‘High Quality’ and ‘Low Quality’, the ‘High Quality’ is set same as ‘Quality A’ in ‘quality.factor.ABC’ to avoid confusion.

===================================================================================================================

Univariate Plots Section

All bins set 50 for better visualization.

This plot shows the count of fixed acidity in wine. It’s median is 6.8 and variance is 0.7121.

This plot shows the count of volatile acidity in wine. It’s median is 0.26 and variance is 0.0102.

This plot shows the count of citric acid in wine. It’s median is 0.32 and variance is 0.0146.

This plot shows the count of residual sugar in wine. It’s median is 5.2 and variance is 25.7258. Residual sugar is sharply peaked between 1 and 2.

This plot shows the count of chlorides in wine. It’s median is 0.043 and variance is 0.0005.

This plot shows the count of free sulfur dioxide in wine. It’s median is 34 and variance is very high (289.2427). But almost every wine contains sulfur dioxide between 0 and 100.

This plot shows the count of total fulfur dioxide in wine. It’s median is 134 and variance is extremely high (1806.0855).

This plot shows the count of density in wine. It’s median is 0.9937 and variance is extremely small.

This plot shows the count of pH in wine. It’s median is 3.18 and variance is 0.0228.

This plot shows the count of sulphates in wine. It’s median is 0.47 and variance is 0.013.

This plot shows the count of alcohol in wine. It’s median is 10.4 and variance is 1.5144.

This plot shows the count of quality in wine. It’s median is 6 and variance is 0.7844.

Almost every wine qualified as 5~7. The wine which have quality 3 and 9 are extremely small amount. (20 wine in quality 3, and only 5 wine in quality 9)

Univariate Analysis

Structure of dataset

The structure of dataset is dataframe with 12 variables. (except ‘X’ variable) 11 of variables are input data (chemical composition), and one variable ‘quality’ is output data.

Main feature of interest

My main feature of interest is quality. Because of all wine from same producer, it is no use to find other variable such as producing district.

Other helpful features in the dataset

The variable with large variation might be helpful for analyzing data. Because this is categorical problem. If values concentrated too much to specific range, analysis might become more harder.

New variables which is not exist in the dataset

I created quality.factor by making category with quality variable. And quality.factor_ABC and quality.factor_HL for more simplifying category and preventing outlier.

Investigating unusual distributions

Residual sugar and chlorides are especially concentrated. But Free sulfur dioxide and total sulfur dioxide are widely spreaded.

I checked all variance to find which variable is fluential.

[variance list (sorted)] density 8.94e-06 chlorides 0.000477 volatile acidity 0.010159 sulphates 0.013 citric acidity 0.01464 pH 0.0228 fixed acidity 0.71211 quality 0.78 alcohol 1.51 residual sugar 25.72577 free sulfur dioxide 289.2427 total sulfur dioxide 1806.085

The variance of six variables (density, chlorides, volatile acidity, sulphates, citric acidity, pH) is smaller than others. Which seems those 6 variables fluent less than other variables.

Residual sugar and chlorides have unusual distribution (higher and narrower peak than others, although residual sugar has high variance).

Most of wine are between 5 to 7 quality grades

===================================================================================================================

Bivariate Plots Section

First, the variables between output(quality) and others will be checked. The plot drew by boxplot with line of smoothed conditional means. Correlation (pearson method) and jitter also added for better visualization.

The correlation between two value is -0.1136. There is no significant relationship between two values. Quality 6 has too many wine so its color is too thick even after applied alpha.

The correlation between two value is -0.1947. There is no significant relationship between two values.

The correlation between two value is -0.0092. There is almost no relationship between two values.

The correlation between two value is -0.0975. There is no significant relationship between two values. Residual sugar concentrated below (between 1~2).

The correlation between two value is -0.2099. Now the first value that over 0.2, but still it’s not significant.

The correlation between two value is 0.0081. There is almost no relationship between two values.

The correlation between two value is -0.1747. There is almost no relationship between two values.

The correlation between two value is -0.3071. There is some relationship between two values.

The correlation between two value is 0.0994. There is almost no relationship between two values.

The correlation between two value is 0.0536. There is almost no relationship between two values.

The correlation between two value is 0.4355. It is most significant value within quality values. Which means, if lots of alcohol included, ther higher chance to be ranked high quality. We will check this with another plots below.

This is alcohol ‘density plot’ (Y-axis is not ‘density from dataset’). It is hard to recognize because of outlier such as quality 9. Let’s look it with 3-divided and 2-divided form below.

It looks better than above, but let’s see about 2-divided form.

It looks much better. The alcohol definitely divides the quality of white wine.

We can see every correlation with this plot. For the quality factor, the alcohol(correlation 0.436) and density(correlation -0.307) seem most influential.

Except quality variable, the correlation between alcohol and residual sugar scores -0.451, alcohol and total sulfur dioxide scores -0.449, alcohol and density scores -0.78, pH and fixed acidity scores -0.426, density and residual sugar scores 0.839, density and total sulfur dioxide scores 0.53, total sulfur dioxide and residual sugar 0.401, total sulfur dioxide and free sulfur dioxide 0.616 seem to be relational.

Bivariate Analysis

Relationships of main feature

The most important variable ‘quality’ has relation with alcohol and density. When considering correlation with quality, alcohol(0.436) and density(-0.307) seem most fluential. Then chlorides(-0.21), volatile acidity(-0.195), total sulfur dioxide(-0.175), fixed acidity(-0.114) and so on. High quality wine tend to higher alcohol, lower density, and lower chlorides than low quality wine.

Relationships between the other features

‘alcohol-density’ and ‘density-residual sugar’ are most relative.

Strongest relationship

The strongest relationship can found between ‘density’ and ‘residual sugar’. It’s correlation is 0.839.

===================================================================================================================

Multivariate Plots Section

The variable which has highest correlation with quality need to be plotted. Alcohol and density are two highest correlation with quality

It is plot for quality, alcohol, and density. Three wine removed because they are outlier. (all three wine are quality 6). In this plot, we can observe that cyan-color dots (which is 6~7 quality wine) and purple dots (which is 8 quality wine) are slightly more in right-down side than the green dots (which is 5 quality wine) and light-green dots (which is 4 quality wine). But this scatter plot is hard to recognize because almost 5000 dots are concentrated. So another plots drew below.

This is density2d plot for finding where many variables concentrated in. But 9-quality wine concentrated in two places. Remember that there are only five 9-quality wine.

So I grouped into 3 groups. A(7~9 quality), B(6 quality) and C(3~5 quality).

It is similar to first plot above. Now let’s see density plot again.

Three groups divided but green (B group) dots are all over the area. So I picked up each group to draw three plots below.

This is the plot of ‘C group’ wine. It tend to place in left-up side.

This is the plot of ‘B group’ wine. It place in everywhere but slightly more in left-up side.

This is the plot of ‘A group’ wine. It tend to place in right-down side.

I also divided to 2 groups, ‘High quality (3~6)’ and ‘Low quality (7~9)’.

It’s easy to recognize what quality of wine tend to which side. Let’s see density plot.

The high quality wine are definitely concentrated in right-down side. But some outlier in high quality wine placed in left-up side.

Next, let’s see plot below with other variables.

This plot drewed with quality, density and residual sugar. Because correlation between density and residual sugar is biggest value in all correlations (0.839). But unfortunately, quality and residual sugar have low correlation (-0.0976).

Let’s see density plots with normal(9 qualities), A/B/C (3 groups), H/L (2 groups).

Unfortunately, it was hard to find significant difference between two groups. It seems there is difference in first plot (density plot with 9 qualities). But when divided to 3 or 2 groups, there are only little differences.

Next, with quality, density and chlorides.

It was hard to find difference.

Next, quality, alcohol and chlorides.

It seems quality is slightly differd by chlorides.

For further explore, I narrowed dataset to find another variable that influence quality. The dataset had narrowed by using ‘quality-alcohol-density’ plot.

##  fixed.acidity   volatile.acidity  citric.acid     residual.sugar  
##  Min.   :3.900   Min.   :0.1100   Min.   :0.0000   Min.   : 0.800  
##  1st Qu.:6.100   1st Qu.:0.2500   1st Qu.:0.2800   1st Qu.: 1.600  
##  Median :6.600   Median :0.3000   Median :0.3100   Median : 2.700  
##  Mean   :6.625   Mean   :0.3097   Mean   :0.3263   Mean   : 3.381  
##  3rd Qu.:7.100   3rd Qu.:0.3500   3rd Qu.:0.3600   3rd Qu.: 4.800  
##  Max.   :9.600   Max.   :1.1000   Max.   :1.6600   Max.   :11.250  
##                                                                    
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01400   Min.   : 3.00       Min.   : 19.0       
##  1st Qu.:0.02900   1st Qu.:22.00       1st Qu.: 94.0       
##  Median :0.03300   Median :30.00       Median :111.0       
##  Mean   :0.03408   Mean   :30.03       Mean   :111.8       
##  3rd Qu.:0.03800   3rd Qu.:37.00       3rd Qu.:126.0       
##  Max.   :0.11500   Max.   :74.00       Max.   :294.0       
##                                                            
##     density             pH          sulphates         alcohol     
##  Min.   :0.9877   Min.   :2.820   Min.   :0.2300   Min.   :12.05  
##  1st Qu.:0.9895   1st Qu.:3.100   1st Qu.:0.3800   1st Qu.:12.30  
##  Median :0.9902   Median :3.200   Median :0.4500   Median :12.50  
##  Mean   :0.9903   Mean   :3.197   Mean   :0.4792   Mean   :12.55  
##  3rd Qu.:0.9911   3rd Qu.:3.290   3rd Qu.:0.5600   3rd Qu.:12.80  
##  Max.   :0.9923   Max.   :3.590   Max.   :1.0800   Max.   :13.20  
##                                                                   
##     quality      quality.factor quality.factor_ABC    quality.factor_HL
##  Min.   :3.000   3:  2          A: 71              High Quality: 71    
##  1st Qu.:6.000   4:  7          B:233              Low Quality :518    
##  Median :7.000   5: 17          C:285                                  
##  Mean   :6.581   6:259                                                 
##  3rd Qu.:7.000   7:233                                                 
##  Max.   :9.000   8: 67                                                 
##                  9:  4

I changed the group because the wine which have 3, 4, 5 qualities are almost gone.

Several variables that have higher correlation value have been drew. (free sulfur dioxide, total sulfur dioxide, residual sugar, pH, chlorides) But unfortunately, it was hard to find significant difference.

Multivariate Analysis

Relationships that observed in this part

‘Alcohol-density’ have the most significant relationaship. It can easily recognized and easily plotted.

Other interesting interactions between features

Some feature like ‘chlorides’, ‘residual sugar’ and ‘free sulfur dioxide’ were interesting. But I surprised because there were no significant difference by ‘residual sugar’ and ‘citric acid’.

===================================================================================================================

Final Plots and Summary

Plot One

Description One

This plot shows relationship between ‘Quality’ and ‘Alcohol’. The correlation between ‘Quality’ and ‘Alcohol’ (which shows blue line) is bigger than any correlation which contains ‘Quality’. That’s the reason why I chose this plot. And it is boxplot that contains many information (medians, quantiles and outliers). So we can observe effectively. To conclude, the white wine which contains high alcohol tends to be ranked as high quality wine.

Plot Two

Description Two

This plot shows the relationship between ‘Quality’, ‘Alcohol’ and ‘Density’. These three variables have most significant difference than any other variables (include ‘Quality’). The contrast of colors (High = Red, Middle = Green, Low = Blue) helps to observe easily. Conclude by this plot, the white wine from ‘Vinho Verde’ qualified better when it contains more alcohol and less dinsity. Some exceptions exist, but that tendency is certain.

I divided quality as ‘High: 7, 8, 9’, ‘Middle: 6’, ‘Low: 3, 4, 5’. It is more effective to observe plots than ‘Quality 3~9’ or 2-divided quality ‘High/Low Quality’. The ‘High Quality’ contains 1060 wine. The ‘Middle Quality’ contains 2198 wine. The ‘Low Quality’ contains 1640 wine.

Plot Three

Description Three

This plot shows the relationship between ‘Quality’, ‘Alcohol’ and ‘Chlorides(Salt)’. I chose density plot because it is hard to find concentration point in scatter plot. X-axis is ‘Alcohol’ which is definitely divide quality, but Y-axis is hard to find difference. But with this density plot, we can observe the center of concentration point, which differ by quality in Y-axis. If wine includes less chloride, it is tend to be ranked as high quality. Amazingly, ‘Chlorides(Salt)’ is realative to ‘Quality’ much more than ‘Residual Sugar’ or any kind of ‘Acidity’.

I divided quality to three classes as same as ‘Plot two’.

===================================================================================================================

Reflection

The quality of white wine is decided by alcohol, density and chlorides. The good white wine tends to have high alcohol, low density and low chlorides. I tried to find more details on which variables influence quality in range that high quality wine concentrated, but it didn’t goes well as I thought. And I surprised that sweetness(sugar) or sourness(citric acid) didn’t influence taste(quality) that much. It will be more helpful if I get more datasets, such as datasets from another wineries.

===================================================================================================================

Reference

https://www.rdocumentation.org/packages/ggpubr/versions/0.1.0/topics/ggboxplot https://www.rdocumentation.org/packages/ggplot2/versions/1.0.0/topics/geom_density2d http://ggplot2.tidyverse.org/reference/geom_density.html http://ggplot2.tidyverse.org/reference/geom_point.html http://www.sthda.com/english/rpkgs/ggpubr/reference/ggboxplot.html https://support.rstudio.com/hc/en-us/community/posts/207625357-Toggle-80-character-warning-line https://briatte.github.io/ggcorr/ https://www.rdocumentation.org/packages/ggplot2/versions/2.1.0/topics/labs